Skip to content

Add CUDA Target, Runtime, and Kernel CI Support#1

Merged
sunnycase merged 25 commits into
masterfrom
feature/cuda
Jul 2, 2026
Merged

Add CUDA Target, Runtime, and Kernel CI Support#1
sunnycase merged 25 commits into
masterfrom
feature/cuda

Conversation

@sunnycase

@sunnycase sunnycase commented Jun 25, 2026

Copy link
Copy Markdown
Owner

Summary

This PR adds end-to-end CUDA support for the nncase NTT path and native runtime. It introduces a CUDA target/module compiler, CUDA runtime module loading and execution support, CUDA-aware NTT runtime primitives, CUDA kernel tests, and a dedicated Linux CUDA CI job for running those tests separately from the regular CPU/macOS compiler jobs.

Motivation

  • Enable nncase to compile NTT-generated kernels for CUDA and execute them through the native runtime, instead of stopping at code generation.
  • Keep CPU and CUDA generated modules on compatible runtime/operator ABI boundaries while allowing CUDA-specific device entry points and launch behavior.
  • Catch CUDA-specific regressions in CI without making ordinary Linux/macOS compiler jobs depend on GPU availability.
  • Fix issues uncovered while enabling CUDA tests, including CUDA toolkit discovery, device-callable scalar helpers, generated module ABI handling, and reduce-axis normalization in the NTT vectorization/lowering path.

Implementation

  • Added CUDA target plumbing in Nncase.Modules.NTT, including CUDATarget, CUDAModuleCompiler, target abstraction cleanup, and CUDA-aware C/CMake generation.
  • Added native CUDA runtime support with CUDA runtime module/function classes, loader integration, runtime CMake wiring, and ENABLE_CUDA_RUNTIME gating.
  • Added NTT CUDA runtime support for topology, remote tensors, distributed operations, vector ops, profiling, and CUDA runtime entry points.
  • Updated NTT kernels and runtime utilities so generated code can compile for both CPU and CUDA, including device-callable scalar conversions/operators for half and related scalar types.
  • Normalized negative reduce axes during NTT vectorization/lowering instead of IR construction, preserving IR semantics while fixing CUDA reduce vectorization cases.
  • Added CUDA kernel test coverage through UnitTestCUDAKernels and enabled it in CI with a dedicated test-x86_64-linux-cuda job.
  • Kept the regular compiler test job excluding UnitTestCUDAKernels so CPU-only Linux/macOS jobs remain independent from CUDA runtime availability.

Validation

  • git diff --check
  • YAML parsing for .github/workflows/compiler-build.yml
  • dotnet build modules/Nncase.Modules.NTT/Nncase.Modules.NTT.csproj -c Release --no-restore
  • Rebuilt the native runtime locally with clang, CUDA 12.8, and ENABLE_CUDA_RUNTIME=ON.
  • Verified the installed native runtime exposes CUDA runtime support and links against CUDA runtime libraries.
  • Verified generated CUDA modules compile locally with clang++ and CUDA 12.8 for the reduce/vectorization repro cases.
  • dotnet test src/Nncase.Tests/Nncase.Tests.csproj -c Release --no-build --no-restore --filter "FullyQualifiedName~Nncase.Tests.TargetTest.UnitTestCUDAKernels.TestVectorizeReduce": 8/8 passed locally.
  • Existing PR compiler/runtime/code-format checks passed before enabling the dedicated CUDA CI job; the latest CI run is rerunning with the new CUDA check included.

Limitations

  • Full UnitTestCUDAKernels execution requires a Linux runner with an NVIDIA GPU, CUDA toolkit, nvcc, clang/clang++, and the labels self-hosted, linux, x64, cuda. GitHub-hosted CPU runners cannot execute these runtime tests.
  • The dedicated CUDA CI job currently uses CUDA architecture 120, matching the local validation environment.
  • The general compiler test job still intentionally excludes UnitTestCUDAKernels; CUDA runtime tests are expected to run only in the dedicated CUDA job.
  • HuggingFace importer tests still depend on an available HF_HOME cache/configuration in CI and should be revisited separately from the CUDA runtime work.

Backlog

Future CUDA follow-up work is tracked in #2.

@sunnycase sunnycase closed this Jun 25, 2026
@sunnycase sunnycase reopened this Jun 25, 2026
@github-actions

github-actions Bot commented Jun 29, 2026

Copy link
Copy Markdown

Test Results

3 318 tests   3 318 ✅  1h 54m 25s ⏱️
    5 suites      0 💤
    5 files        0 ❌

Results for commit 49f5205.

♻️ This comment has been updated with latest results.

@sunnycase sunnycase marked this pull request as ready for review July 1, 2026 06:20
@sunnycase sunnycase changed the title [codex] Add CUDA support Add CUDA Target, Runtime, and Kernel CI Support Jul 2, 2026
@sunnycase sunnycase mentioned this pull request Jul 2, 2026
4 tasks
@sunnycase sunnycase merged commit ba3b7d6 into master Jul 2, 2026
25 of 27 checks passed
@sunnycase sunnycase deleted the feature/cuda branch July 2, 2026 02:50
sunnycase added a commit that referenced this pull request Jul 2, 2026
This adds end-to-end CUDA support for the nncase NTT path and native runtime. It introduces a CUDA target/module compiler, CUDA runtime module loading and execution support, CUDA-aware NTT runtime primitives, CUDA kernel tests, and a dedicated Linux CUDA CI job for running those tests separately from the regular CPU/macOS compiler jobs.

Motivation:
- Enable nncase to compile NTT-generated kernels for CUDA and execute them through the native runtime, instead of stopping at code generation.
- Keep CPU and CUDA generated modules on compatible runtime/operator ABI boundaries while allowing CUDA-specific device entry points and launch behavior.
- Catch CUDA-specific regressions in CI without making ordinary Linux/macOS compiler jobs depend on GPU availability.
- Fix issues uncovered while enabling CUDA tests, including CUDA toolkit discovery, device-callable scalar helpers, generated module ABI handling, and reduce-axis normalization in the NTT vectorization/lowering path.

Implementation:
- Added CUDA target plumbing in Nncase.Modules.NTT, including CUDATarget, CUDAModuleCompiler, target abstraction cleanup, and CUDA-aware C/CMake generation.
- Added native CUDA runtime support with CUDA runtime module/function classes, loader integration, runtime CMake wiring, and ENABLE_CUDA_RUNTIME gating.
- Added NTT CUDA runtime support for topology, remote tensors, distributed operations, vector ops, profiling, and CUDA runtime entry points.
- Updated NTT kernels and runtime utilities so generated code can compile for both CPU and CUDA, including device-callable scalar conversions/operators for half and related scalar types.
- Normalized negative reduce axes during NTT vectorization/lowering instead of IR construction, preserving IR semantics while fixing CUDA reduce vectorization cases.
- Added CUDA kernel test coverage through UnitTestCUDAKernels and enabled it in CI with a dedicated test-x86_64-linux-cuda job.
- Kept the regular compiler test job excluding UnitTestCUDAKernels so CPU-only Linux/macOS jobs remain independent from CUDA runtime availability.

Validation:
- git diff --check
- YAML parsing for .github/workflows/compiler-build.yml
- dotnet build modules/Nncase.Modules.NTT/Nncase.Modules.NTT.csproj -c Release --no-restore
- Rebuilt the native runtime locally with clang, CUDA 12.8, and ENABLE_CUDA_RUNTIME=ON.
- Verified the installed native runtime exposes CUDA runtime support and links against CUDA runtime libraries.
- Verified generated CUDA modules compile locally with clang++ and CUDA 12.8 for the reduce/vectorization repro cases.
- dotnet test src/Nncase.Tests/Nncase.Tests.csproj -c Release --no-build --no-restore --filter "FullyQualifiedName~Nncase.Tests.TargetTest.UnitTestCUDAKernels.TestVectorizeReduce": 8/8 passed locally.

Limitations:
- Full UnitTestCUDAKernels execution requires a Linux runner with an NVIDIA GPU, CUDA toolkit, nvcc, clang/clang++, and the labels self-hosted, linux, x64, cuda.
- The dedicated CUDA CI job currently uses CUDA architecture 120, matching the local validation environment.
- Future CUDA follow-up work is tracked in #2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant